Implement Glean-based crash pings in Fenix
Categories
(Firefox for Android :: Experimentation and Telemetry, enhancement)
Tracking
()
People
(Reporter: gsvelto, Assigned: afranchuk)
References
()
Details
Attachments
(1 file)
As per title, the idea is to implement the minimal crash ping introduced in bug 1790569. While the ping will have only a small amount of functionality it will allow us to gauge Fenix actual crash rate. It will also be improved in lockstep with the desktop version so that we maintain feature equality between the two.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
•
|
||
I'm becoming familiar with the Firefox for Android code, and it seems like there's a number of possible scopes for this change.
-
GeckoView: adding crash pings here may best achieve the desired outcome, depending on how you look at it. We could get crash pings from anyone using GeckoView (which may be projects other than Fenix). On one hand, this would cover all Gecko crashes, but on the other, some people using GeckoView may not like telemetry being built-in. However, it could be opt-in. I expect the only crashes that could be reported would be non-fatal ones.
It's worth noting that if we want to capture other (non-Gecko) crashes in Fenix, maybe those could be reported separately (e.g. as a separate crash "product" than GeckoView), which may lend itself better to debugging/metrics tracking.
geckoview
doesn't currently contain Glean, howevergeckoview-omni
does. -
Android Components lib-crash: the crash library already supports Sentry and Socorro as services, so adding a Glean service here would be appropriate. This is also where the crash count Glean metric is already integrated.
One might consider this equivalent to (1) in some regards, since an app writer (and Fenix) may opt-in by using lib-crash (as long as they're aware it exists).
-
Fenix: adding crash pings directly to Fenix would be fairly straightforward. The codebase already uses Glean, sends a number of pings, and has existing logic managing crash events. However, currently crash reporting in Fenix is done through (2), so putting it directly in the Fenix code is likely not the best approach.
Reporter | ||
Comment 2•2 years ago
|
||
My take on this is that 2. is the best choice. Crash pings are per-product so they should not be in GeckoView as that's just the foundation for building a browser (or an app that's not a browser!). Android Components is a good fit because it already has all the plumbing in place and is used by Fenix - but other apps based on GeckoView won't necessarily use it which is fine. That being said the final choice rests on the Fenix team so we'll have to ask for some feedback directly from them.
Assignee | ||
Comment 3•2 years ago
|
||
Okay, I expect they'll agree with your opinion, seeing as that's where all the other crash ping/reporting is handled. I wasn't sure whether GeckoView itself is viewed as an individual product, which is why I brought up the possibility of getting crash information directly from it.
I suppose if someone were using GeckoView and experiencing crashes, our first response (assuming they filed a bug) might be to advise them to add lib-crash so that we can get a better idea of what is going on at the moment of the crash (if it isn't clear enough already).
Comment 4•2 years ago
|
||
There already is a Glean crash reporting service defined in lib-crash that handles the messier parts of making things work with Glean running in another process.
I would love to see the lib-crash Glean stuff expanded and/or modified to suit the purposes of this bug, please do let me know if I may be of assistance.
Assignee | ||
Comment 5•2 years ago
|
||
I saw that in the README; I'm glad there's some precedent and existing code to handle that edge case. It seems like that's the most appropriate place for it.
Reporter | ||
Comment 6•2 years ago
|
||
FYI in case it might be useful this is the code that assembled the ping in Fennec:
- https://hg.mozilla.org/mozilla-central/file/3c6468b1512e/mobile/android/geckoview/src/main/java/org/mozilla/gecko/CrashReporterService.java
- https://hg.mozilla.org/mozilla-central/file/3c6468b1512e/mobile/android/geckoview/src/main/java/org/mozilla/gecko/CrashHandler.java
It was intermingled with the crash reporting code. The bits that invoked the stack walker were here (just for reference as you won't need them for now, and besides we'll probably do things differently):
- https://hg.mozilla.org/mozilla-central/file/3c6468b1512e/mobile/android/geckoview/src/main/java/org/mozilla/gecko/mozglue/MinidumpAnalyzer.java
- https://hg.mozilla.org/mozilla-central/file/3c6468b1512e/mozglue/android/nsGeckoUtils.cpp#l151
- https://hg.mozilla.org/mozilla-central/file/3c6468b1512e/toolkit/crashreporter/minidump-analyzer/minidump-analyzer.cpp#l420
Assignee | ||
Comment 7•2 years ago
|
||
I've temporarily copied the {metrics,pings}.yaml
files into my local copy (in lieu of bug 1811928) and have updated the GleanCrashReporterService
to persist/send the crash pings. I'm wondering whether this obsoletes the crash_metrics.crash_count
metric: is there any advantage to having the labeled_counter
when you could presumably aggregate the same results amongst submitted crash pings? Though worth noting: the crash pings currently don't have a metric which encompasses uncaught exceptions. Would we want to classify that as a process_type
(I'd say that's technically correct), or should that be reserved for gecko process types and we add another metric for this sort of distinction?
The work so far is nominal, since I can't get the glean parser to run correctly (but there are no other language check errors so it should compile after that is fixed). I see
Execution failed for task ':lib-crash:gleanGenerateMetricsDocsForDebug'.
> Glean documentation generation failed.
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
which I'll dig into further if I can figure out where to poke the gradle build files to pass those suggested flags.
Assignee | ||
Updated•2 years ago
|
Comment 8•2 years ago
|
||
I'd keep crash_count
for a least a couple releases to ensure there's agreement. We might find some differences between measuring things two ways (well, we always do, but we might find meaningful differences) that we might've missed without something to compare against.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 10•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 11•2 years ago
|
||
The PR has landed.
Description
•